-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add torchcompile_xentropy executor #1655
Conversation
This executor is meant for fusions with a |
@IvanYashchuk Adding cross entropy does indeed not help with rope but it allows thunder to use a very efficient fused cross entropy triton kernel which perf currently cannot be matched with the other executors(not even apex). From lines 211 and 212, it does seem that while the set of ops was originally though for rope, it is suggested that other ops could have been added in the future making this a fusion executor that comes in rescue of nvFuser when it can improve performance. Do you think that it would be better to create another executor entry for inductor cross entropy only? |
… torchcompilecat-add-xentropy
This reverts commit 6c5b0e1.
for more information, see https://pre-commit.ci
Reducing the models size to speedup CI and make it work in environments with constrained memory size. I think this change should be fine since with this test we are verifying functionality more than memory footprint. Let me know what you think. With this change the peak memory of the test is ~2.6GB instead of ~14GB
I have a hard time seeing another torch.compile executor as the solution here. |
I'd piggyback on this explaination on the differences between lightning-thunder/thunder/executors/torch_compile.py Lines 209 to 215 in 880419d
An issue I see is that we don't strictly enforce this rule, so in case where the user specifies |
My main issue with this is that we should really have one torchcompile fusion executor in order to not split what could be a big fusion. And the solution would be to split the marking of things to be fused by the torchcompile executor and then But let's have that later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @riccardofelluga
What does this PR do?
In an attempt to fix #1654 and partially #1552, this PR adds the necessary ops in the
torchcompile_cat
list to capture HF CausalLMLoss.Before merging this PR needs testing with other models, I will write a message with the results of the benchmarks.